Python | C-строки сомнительного кодирования | Комплект-1

Опубликовано: 12 Апреля, 2022

One can convert strings between C and Python vice-versa but the C encoding is of a doubtful or unknown nature. Let’s suppose that a given C data is supposed to be UTF-8, but it’s not being strictly enforced. So, it is important to handle such kind of malformed data so that it doesn’t crash Python or destroy the string data in the process.

Code#1 : C data and a function illustrating the problem.
/* Some dubious string data (malformed UTF-8) */
const char* sdata = "Spicy Jalapexc3xb1oxae";
int slen = 16;
/* Output character data */
void print_chars(char* s, int len)
{
    int n = 0;
    while (n < len) {
        printf("%2x ", (unsigned char)s[n]);
        n++;
    }
    printf(" ");
}

In the code above, the string sdata contains a mix of malformed data and UTF-8. Nevertheless, if a user calls print_chars(sdata, slen) in C, it works fine.

Now suppose one wants to convert the contents of sdata into a Python string, further passing that string to the print_chars() function through an extension. The code given below shows the way that exactly preserves the original data even though there are encoding problems.

Code#2 :

/* Return the C string back to Python */
static PyObject *py_retstr(PyObject *self, PyObject *args)
{
    if (!PyArg_ParseTuple(args, ""))
    {
        return NULL;
    }
    return PyUnicode_Decode(sdata, slen, "utf-8", "surrogateescape");
}
  
/* Wrapper for the print_chars() function */
static PyObject *py_print_chars(PyObject *self, PyObject *args)
{
    PyObject *obj, *bytes;
    char *s = 0;
    Py_ssize_t len;
    if (!PyArg_ParseTuple(args, "U", &obj))
    {
        return NULL;
    }
    if ((bytes = PyUnicode_AsEncodedString(obj,
    "utf-8","surrogateescape"))
            == NULL)
    {
        return NULL;
    }
    PyBytes_AsStringAndSize(bytes, &s, &len);
    print_chars(s, len);
    Py_DECREF(bytes);
    Py_RETURN_NONE;
}

Code#3 : Using the above code 2

s = retstr()
printf (s)
  
printf (" ", print_chars(s))
"Spicy Jalapeñoudcae"

53 70 69 63 79 20 4a 61 6c 61 70 65 c3 b1 6f ae

Here, one can see that the malformed string got encoded into a Python string without errors and that when passed back into C, it turned back into a byte string that exactly encoded the same bytes as the original C string.

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course

Previous
PImpl Idiom in C++ with Examples
Next
Reading Python File-Like Objects from C | Python
Recommended Articles
Page :
Article Contributed By :
manikachandna97
@manikachandna97
Vote for difficulty
Article Tags :
  • Python-ctype
  • Python
Report Issue
Python

РЕКОМЕНДУЕМЫЕ СТАТЬИ