Archive for the ‘Python’ Category

The Python module for file type identification, called ‘magic’, is not standardized

Sunday, March 3rd, 2013

I found the hard way that the API exported by the Python module ‘magic’ differs among different versions of the module.

The version installed when installing the Debian package ‘python-magic’ expects the following API:

import magic
mymagic = magic.open(magic.MAGIC_MIME_TYPE)
mymagic.load()
mtype = mymagic.file(inpfname)
print(”The MIME type of the file %s is %s” % (inpfname,mtype))

The version installed using ‘pip install python-magic’ expects the following API:

import magic
mymagic = magic.Magic(mime=True)
mtype = mymagic.from_file(inpfname)
print(”The MIME type of the file %s is %s” % (inpfname,mtype))

The following code allows the rest of the script to work the same way with either version of ‘magic’:

import magic
def build_magic():
  try:
    mymagic = magic.open(magic.MAGIC_MIME_TYPE)
    mymagic.load()
  except AttributeError,e:
    mymagic = magic.Magic(mime=True)
    mymagic.file = mymagic.from_file
  return(mymagic)
mymagic = build_magic()
mtype = mymagic.file(inpfname)
print(”The MIME type of the file %s is %s” % (inpfname,mtype))

PyGuile - Part 5 - Python objects (PyObjects) as proxies for Guile objects (SCMs)

Thursday, October 9th, 2008

An essential part of integration of Scheme (as implemented in Guile) and Python is allowing Python code to call back code implemented in Scheme. It is also desirable to be able to access data and invoke methods on otherwise-opaque objects created and managed in Scheme. The specifics of opaque object access should also be independent of the specific object system being used in the Scheme-written part of the application.

When implementing in PyGuile the proxying scheme, in which PyObjects serve as proxies for SCMs, the following need to be taken into account:

  • Data type conversion
  • Retention of references as a protection against garbage collection

Data type conversions

A Python callback can be expected to receive positional and keyword arguments, and return a result of any type. Therefore, templates (possibly trivial) for converting between PyObjects (Python data types) and SCMs (Guile data types) need to be associated with each callback.

In the case of objects, we need to associate, with each attribute getter, a template for converting the value from SCM into PyObject. With each attribute setter, a template for converting the value from PyObject into SCM needs to be associated. With each method, which can be invoked on the object, minimum two templates are needed. Three templates should be provided, in case the object needs to be manipulated by an interface, which expects both positional and keyword arguments in the object’s methods.

All the templates needed to work with a SCM (as a callback or as an object) are associated with it when it is wrapped as it is being passed from Guile to Python.

Retention of references

PyObjects, which wrap SCMs, are not expected to be seen by Guile’s garbage collector. Therefore, we need a mechanism for protecting SCMs referenced by PyObjects.

Due to efficiency considrations, Guile’s scm_permanent_object, scm_gc_protect_object or scm_gc_unprotect_object should not be used on every SCM passed to Python. The solution is to create a set object in Guile, protect it using scm_permanent_object (a single call) and then register in it all wrapped SCMs. When a wrapping PyObject’s __del__ function is invoked, one of its actions is to remove the corresponding SCM object from the set. The set will be implemented using a standard hash table, whose keys will be indexes and the data - the SCMs themselves.

PyGuile - Part 4 - Argument and result conversion issues

Monday, September 22nd, 2008

There is no 1:1 mapping between Scheme and Python data types. As a consequence, there are several cases, in which PyGuile has to guess how would the user like to have the arguments and result of a Python function converted. Instead of guessing, we would like to empower the user to be explicit about the kinds of conversion which he wants.

The following is a census of ambiguous data conversion cases, which I identified.

  1. Scheme pair ->
    • Python 2-Tuple
    • Python 2-List
  2. Scheme list ->
    • Python Tuple
    • Python List
    • Nested tree of pairs (2-Tuples or 2-Lists)
  3. Python 2-Tuple or 2-List ->
    • Scheme pair
    • Scheme list
    • Scheme rational (if the Pythonic data structure consists of two integer values)
  4. Scheme alist (association list) ->
    • Python Dict
    • Python Tuple/List of 2-Tuples
  5. Python string ->
    • Scheme string
    • Scheme symbol
    • Scheme keyword

    Additional considrations:

    • Case sensitivity of symbols and keywords
    • String representation of keywords in Guile has leading dash - to retain or remove it in the Python side of affairs?
  6. Python 1-character string ->
    • Scheme char
    • Scheme string

    Additional considration: utf-8 encoded glyph is a sequence of few characters.

  7. Python int ->
    • Scheme int
    • Scheme bignum
    • Scheme char
  8. Python None -> One of several possible values: ‘(), #f, SCM_EOL, ‘*None* or another custom Scheme value.
  9. Python (),[],{} ->
    • Scheme ‘()
    • SCM_EOL
    • Custom Scheme value
  10. Scheme ‘() ->
    • Python ()
    • Python []
    • Python {}
  11. SCM_EOL ->
    • Python (),[],{}
    • Python None
    • Custom Python value
  12. Scheme rational ->
    • Python Float
    • 2-Tuple of Python Ints
  13. Scheme exact/inexact flag in numerical values - if and how to represent it in the Python side of the application?
  14. Giant data structures with sparse access needs - lazy vs. eager conversion
  15. Exception objects
  16. Objects of certain classes (vectors, ports, functions, images, etc.)

There is also the separate issue of string encoding/decoding, with which we deal by mandating that anything passing between Scheme and Python has to be utf-8 encoded.

One of the goals of PyGuile is to make it efficient to invoke Python library functions from Guile. Therefore, efficiency of conversion of function arguments and results is critical.

When there are no user hints, the following inefficiencies occur:

  1. PyGuile has to make a default (and possibly sub-optimal) choice when encountering one of the above ambiguous cases. Then the script using the data has to reformat it to match the data format to its actual needs.
  2. PyGuile has to identify the data type of each datum. The present implementation does not go into the internal representation of Guile (SCM) and Python (PyObject) objects, therefore PyGuile has to test for various data types one by one, until one of them matches the argument.
  3. Sometimes a Python procedure needs to do no processing on one of its arguments. The argument’s value needs only to be passed around as a pointer, or to be inserted into the right place in a result data structure. In such a case, it is desirable to use the most efficient conversion possible i.e. wrap/unwrap opaque objects. This is a generalization of the case of giant data structures with sparse access needs.

Therefore, when performance is critical, hints from the user would help not only to disembiguate the conversion process but also to speed it up.

The user hints will be implemented as follows.
With each function (Python function invoked from Guile, or Guile function invoked from Python) we associate two (possibly degenerate) signatures. One signature will contain the hints for converting the function’s arguments. The second signature will hint how to convert the function’s result. The signatures are Scheme lists, whose leaf nodes are symbols denoting conversion functions.

Chris Jester-Young, in his answer to my question in Stackoverflow, proposed the following function for traversing two corresponding tree structures, and applying the functions in one of them to data in the other one.

  (define (map-traversing func data)

    (if (list? func)

        (map map-traversing func data)

        (func data)))

Using it requires unquoting. Example:

  (map-traversing `((,car ,cdr) ,cadr) '(((aa . ab) (bb . bc)) (cc cd . ce)))

Our implementation will differ from the above in details, as the signatures’ leaf nodes do not denote proper Scheme functions.

PyGuile - Part 2 - Design Issues

Saturday, September 20th, 2008

While working on the PyGuile, I identified the following design issues.

  1. The data type trees of Scheme and Python do not have an 1:1 correspondence.
    • Do we want to convert a Scheme list into a Python Tuple or a Python List?
    • How about an alist (associative list) - should be a Python List of 2-tuples or a Python Dict?
    • And in the other direction - do we want to convert a Python string into a Scheme string, symbol or keyword?
  2. API for adding plugins which convert between Guile and Python representations of useful data types (such as file handles, images or Berkeley sockets).
  3. How do we want to pass large data structures - convert them immediately, or employ lazy conversion (convert an element only when it is requested)? If we employ lazy conversion, how do we implement the associated bookkeeping? See more about this below.
  4. How do we deal with the different garbage collection regimes of Guile and Python? In particular, how do we make SCM objects owned by Python objects known to the Guile garbage collector?
  5. How will we support Unicode? Bear in mind that we want to minimize manipulations of long text strings.
  6. How to allow each scripting language to seamlessly invoke functions in the other scripting language?

The problem of lack of 1:1 correspondence will be dealt with as follows.

A standard conversion convention, which will work for the overwhelming majority of cases, will be employed. Functions, which have special needs, will have their argument conversions specified by means of a suitable tree-structured template.

When passing a data structure (or object) created in language A to language B, the following cases can happen:

  1. Opaque pointer - B only passes it around. A performs all processing and B just holds the pointer for future reference.
  2. B accesses a single element (or small number of elements) in the data structure.
  3. B loops over all elements of the data structure.
  4. B needs arbitrary access to several elements of the data structure (example: image processing).

Those cases can be dealt with as follows:

  • Case 1 can be handled by wrapping a language A pointer by a language B object, which carries opaque data around.
  • Cases 2,3 can be dealt by means of custom data access procedures (such as Python’s __getitem__()). An element will be converted only when it is actually requested. Elements in nested data structures can be dealt with as in case 1.
  • Case 4 can be handled by implementing a mechanism for plugging in and registering custom conversion functions for specific data types.

In practice, the most tough design issue, which I identified so far, is the management of the SCM objects owned by Python objects.

When a SCM object is assigned to an attribute of a Python object, some registration mechanism needs to
be invoked so that the SCM object can be reclaimed by the Guile garbage collector if the Python object goes out of scope. The registration mechanism needs also to take care of marking the SCM objects while they are owned by a living Python object.

PyGuile - Part 1 - Using Python libraries in Guile (a Scheme implementation) scripts

Friday, September 19th, 2008

For long time I have dreamt of invoking Python libraries from scripts written in Scheme. The reason for this is to be able to enjoy the fantastically rich control structures possible in Scheme, yet use familiar libraries to accomplish useful actions, some of which are unavailable in SLIB and other Scheme libraries.

Now at last I am working on realizing this dream. The Scheme implementation being used is version 1.6 of Guile and the Guile extension being developed embeds a Python 2.4 interpreter. In the future, more recent versions of Guile and Python will be used.

The goals of the project are:

  1. Make it easy to invoke Python libraries from Guile.
  2. The integration between Python and Guile is to be seamless.
  3. The architecture of the implementation shall enable optimizations for efficient runtime behavior.

To accomplish those goals, it is necessary to:

  1. Convert primitive Scheme data types (integers, reals, Booleans, strings, lists) into the corresponding Python data types, and vice versa.
  2. Be able to invoke functions defined in one language from the other language. This has to be bidirectional in order to support callbacks.
  3. Be able to pass around pointers to objects (as opaque values) and invoke methods over them.
  4. Have efficient transfer of control and data between both languages.
  5. Deal with different garbage collection conventions in both environments.
  6. Be able to optimize code for a particular pair of language runtime systems.
  7. Nice to have: support for recursion, especially tail recursion.
  8. Nice to have: thread-safety.

It is envisioned that the software developed in this project will be part of a larger system, which will allow more scripting languages to interoperate with Guile and with each other.

There is another project - Schemepy - which embeds a Scheme interpreter in Python scripts.  This project has different focus and it essentially allows Scheme to be used for those parts of a project, in which its strengths are especially important.

Choosing a Python module for accessing Microsoft SQL Server Unicode data

Tuesday, December 25th, 2007

One day I found myself in need of Python code, which retrieves Unicode data from Microsoft SQL Server tables. The code needs to run on a PC with MS-Windows XP.

The dbi and odbc modules, which I used in the past, failed miserably in this task, by forcing the Unicode data to be converted into string data, using the ascii encoder.

So, I had to look for other Python modules. My findings from evaluating the relevant Python modules are summarized below.

dbi,odbc from pywin32
  • Package: pywin32-210.win32-py2.5.exe, available from Python for Windows Extensions.
  • Textual data is passed as strings, rather than as Unicode.
  • Parameters in SQL queries are marked by ‘?’.
  • Dates/times are retrieved as instances of the dbi.dbiDate class (essentially, a wrapped long int).
win32com
I was not successful in using the win32com based code, which worked for
Arik Baratz. According to him, this code uses the Microsoft ActiveX Data Objects 2.8 Library. It requires the modified version 209.1 of pywin32, which comes with version 2.5.1.1 of the ActiveState Python distribution. This modified version adds to the win32com class an extra member - client.
You need to add the following line sometime after the import win32com:

win32com.client.gencache.EnsureModule('{DA9159C1-C9D5-4223-81AC-2B1080701D12}',0,1,0)

To actually start working, use win32com.client.Dispatch() to establish a connection to the SQL Server.

pymssql
pyodbc
  • Package: pyodbc-2.0.39.win32-py2.5.exe, available from pyodbc - A Python DB API module for ODBC
  • Textual data is passed as Unicode.
  • Parameters in SQL queries are marked by ‘?’.
  • Dates/times are retrieved as instances of the datetime.datetime class.

The Python module chosen is pyodbc.