Convert XML UTF-16 to JSON UTF-8 with PHP cURL

Previously titled
Fix PHP cURL: parser error: Document labelled UTF-16 but has UTF-8 content

What?
This is an article with notes for me on how to convert some received XML encoded in UTF-16 to some JSON in UTF-8. If it were entirely in UTF-8, I would simply load the received XML with SimpleXML and use the built-in PHP JSON_encode function. I ran into the following errors:
Warning: SimpleXMLElement::__construct() [<a href='simplexmlelement.--construct'>simplexmlelement.--construct</a>]: Entity: line 1: parser error : Document labelled UTF-16 but has UTF-8 content in /public_html/.../.../my_script.php on line ###

Warning: simplexml_load_string() [<a href='function.simplexml-load-string'>function.simplexml-load-string</a>]: Entity: line 1: parser error : Document labelled UTF-16 but has UTF-8 content in /public_html/.../.../my_script.php on line ###
Why?
So I've googled, binged and yahoo'd for this and although there are some solutions that deal with loading UTF16 content into SimpleXMLElement or simplexml_load_string, it doesn't solve my problem. I'm receiving XML data within a cURL result but I get the above error with using either "SimpleXMLElement" or "simplexml_load_string". Returning the XML with cURL isn't a problem, but I want to convert it to JSON and I usually use a PHP function to load the data into an XML array and use the built-in PHP function: "json_encode".

How?
So here's what I tried and ended up with:

If your XML is UTF-8
This is the basic code and will work to fetch some XML and return it in JSON formatting as long as the XML is encoded in UTF-8.
// set headers for JSON file
// header('Content-Type: application/json'); // seems to cause 500 Internal Server Error
header('Content-Type: text/javascript');
header('Access-Control-Allow-Origin: http://api.joellipman.com/');
header('Access-Control-Max-Age: 3628800');
header('Access-Control-Allow-Methods: GET, POST, PUT, DELETE');

// open connection
$ch = curl_init();

// set the cURL options
curl_setopt($ch, CURLOPT_URL, $api_url);                                // where to send the variables to
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: text/xml'));  // specify content type of what we're sending
curl_setopt($ch, CURLOPT_HEADER, 0);                                    // hide header info !!SECURITY WARNING!!
curl_setopt($ch, CURLOPT_POST, TRUE);                                   // TRUE to do a regular HTTP POST.
curl_setopt($ch, CURLOPT_POSTFIELDS, $api_message_xml);                 // In my case, the XML form that will be submitted
curl_setopt($ch, CURLOPT_TIMEOUT, 15);                                  // Target API has a 15 second timeout
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);                         // TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.

// store the response
$ch_result = curl_exec($ch);

// close connection
curl_close($ch);

// convert the response to xml
$xml_result = simplexml_load_string($ch_result) or die("Error: Cannot create object");

// convert the xml to json
$json_result = json_encode($xml_result);

// print the json
echo $json_result;

// [OPTIONAL] convert it to an array
// $array = json_decode($json_result,TRUE);

// yields <?xml version="1.0" encoding="utf-8"?> ... ... ...

Without cURL
You'll have seen this all over the Internet as the accepted solution... Doesn't work for me because I'm using cURL but it's a first point of reference. This will work if the received XML is a string.
// setting XML value
$string = '<?xml version="1.0" encoding="utf-16"?>
  <Response Version="1.0">
    <DateTime>2/13/2013 10:37:24 PM

With cURL: Other things I tried
ERROR: Using the above preg_replace function
/* Replace UTF-16 with UTF-8 */
$xml_utf8 = preg_replace('/(<\?xml[^?]+?)utf-16/i', '$1utf-8', $ch_result);

$xml_result = simplexml_load_string($xml_utf8);
// yields error 'Entity: line 1: parser error : Document labelled UTF-16 but has UTF-8 content in /public_html/.../.../my_script.php on line ###'
// to catch error use: $xml_result = simplexml_load_string($ch_result) or die("Error: Cannot create object");

ERROR: Using built-in function mb_convert_encoding
/* Convert the UTF-16 to UTF-8: Using function mb_convert_encoding */
$xml_utf8 = mb_convert_encoding($ch_result, 'UTF-8', 'UTF-16');

// yields error 'parser error : Start tag expected, '&lt;' not found in /public_html/.../.../my_script.php on line ###'

ERROR: Using built-in function utf8_encode
/* Convert the UTF-16 to UTF-8 using a function */
$xml_utf8 = utf8_encode($ch_result);

// yields error 'Entity: line 1: parser error : Document labelled UTF-16 but has UTF-8 content in /public_html/.../.../my_script.php on line ###'

ERROR: A potential function to re-encode it from Craig Lotter
/* Convert the UTF-16 to UTF-8 using a function */
$xml_utf8 = utf16_to_utf8($ch_result);

// yields error 'parser error : Start tag expected, '&lt;' not found in /public_html/.../.../my_script.php on line ###'
// also yields: ??? 呭㤳䥆汶摓䉄套㑧唲噬䥅ㅬ䥑㴽

ERRORS: A 2-Hour play around
/* Encode received cURL result in a JSON feed */
$json_encoded_str = json_encode($ch_result);

/* Convert the UTF-16 to UTF-8 using a function */
$json_encoded_str_8 = (string) utf8_encode($json_encoded_str);

/* In the XML, replace the UTF-16 with UTF-8 */
$json_encoded_str = preg_replace('/(<\?xml[^?]+?)utf-16/i', '$1utf-8', $json_encoded_str_8);  

/* In the XML, replace the UTF-16 with UTF-8 */
$json_encoded = json_encode($json_encoded_str);

// yields escaped JSON: "<?xml version=\"1.0\" encoding=\"utf-16\"?><soap:Envelope

ERROR: Using built-in function iconv. Another 4-hour saga
/* Convert the UTF-16 to UTF-8 using a function */
$xml_utf8 = iconv('UTF-16', 'UTF-8', $ch_result);
// $xml_utf8 = iconv('UTF-16BE', 'UTF-8', $ch_result); // same result specifying Big-Endian

// yields error 'error on line 1 at column 1: Document is empty'
// but view the source: 㼼浸敶獲潩㵮ㄢ〮•湥潣楤杮∽瑵ⵦ㘱㼢㰾潳灡䔺癮汥灯⁥浸湬㩳獸㵩栢瑴㩰⼯睷⹷㍷漮杲㈯

// OTHER ERRORS:
// error on line 1 at column 1: Document is empty
// error on line 2 at column 1: Extra content at the end of the document
// error on line 2 at column 1: Encoding error

// error on line 1 at column 491: xmlParseEntityRef: no name
// this is because you need to escape the 5 characters (", ', <, >, &) in XML

NOT-QUITE-RIGHT: Use a Parser and re-Output the XML
// Create an XML parser
$parser = xml_parser_create();

// Stop returning elements in UPPERCASE
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, 0);

// Parse XML data into an array structure
xml_parse_into_struct($parser, str_replace(array("\n", "\r", "\t"), '', $ch_result), $structure);

// Free the XML parser
xml_parser_free($parser);

// create XML string from parsed XML
$xml_string = '';
$xml_escaped_chars = array('"', '\'', '<', '>', '&');
$xml_escaped_chars_rep = array('&quot;', '&apos;', '&lt;', '&gt;', '&amp;');

foreach($structure as $xml_element){

        $this_value = (isset($xml_element['value'])) ? str_replace($xml_escaped_chars, $xml_escaped_chars_rep, trim($xml_element['value'])) : '';
        $this_attr = (isset($xml_element['attributes'])) ? $xml_element['attributes'] : array();
        $this_attr_str = '';
        if (count($this_attr)>0){
                foreach($this_attr as $attr_key => $attr_value){
                        $this_attr_str.= ' '.$attr_key.'="'.$attr_value.'"';
                }
        }
        if ($xml_element['type']=='open'){
                $xml_string.='<'.$xml_element['tag'].$this_attr_str.'>';
        } else if ($xml_element['type']=='complete'){
                $xml_string.='<'.$xml_element['tag'].$this_attr_str.'>'.$this_value.'</'.$xml_element['tag'].'>';
        } else if ($xml_element['type']=='close'){
                $xml_string.='</'.$xml_element['tag'].'>';
        }
}
// $simple_xml = simplexml_load_string($xml_string);  // still fails (not UTF-8)
 echo '<?xml version="1.0" encoding="utf-8"?>'.utf8_encode($xml_string);

// yields <?xml version="1.0" encoding="utf-8"?> ... ... ... (corrupted?)

So...

With cURL - a solution with a compromise
After many more hours, a solution to convert XML in UTF-16 from a cURL source and convert it to JSON. The output isn't necessarily in UTF-8 so I'll update this article if the mobile app has problems reading the JSON feed. When writing the loop of the "not-quite-right" solution above, I found the following function in a discussion thread: Integrating symphony website with external api [whmcs]
// set headers for JSON file
header('Content-Type: text/javascript; charset=utf8');
header('Access-Control-Allow-Origin: http://api.joellipman.com/');
header('Access-Control-Max-Age: 3628800');
header('Access-Control-Allow-Methods: GET, POST, PUT, DELETE');

// the function that will convert our XML to an array
function whmcsapi_xml_parser($rawxml) {
    $xml_parser = xml_parser_create();
    xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, 0);     // stop elements being converted to UPPERCASE
    xml_parse_into_struct($xml_parser, $rawxml, $vals, $index);
    xml_parser_free($xml_parser);
    $params = array();
    $level = array();
    $alreadyused = array();
    $x=0;
    foreach ($vals as $xml_elem) {
      if ($xml_elem['type'] == 'open') {
         if (in_array($xml_elem['tag'],$alreadyused)) {
            $x++;
            $xml_elem['tag'] = $xml_elem['tag'].$x;
         }
         $level[$xml_elem['level']] = $xml_elem['tag'];
         $alreadyused[] = $xml_elem['tag'];
      }
      if ($xml_elem['type'] == 'complete') {
       $start_level = 1;
       $php_stmt = '$params';
       while($start_level < $xml_elem['level']) {
         $php_stmt .= '[$level['.$start_level.']]';
         $start_level++;
       }
       $php_stmt .= '[$xml_elem[\'tag\']] = $xml_elem[\'value\'];';
       @eval($php_stmt);
      }
    }
    return($params);
}

// open connection
$ch = curl_init();

// set the cURL options
curl_setopt($ch, CURLOPT_URL, $api_url);                                // where to send the variables to
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: text/xml'));  // specify content type of what we're sending
curl_setopt($ch, CURLOPT_HEADER, 0);                                    // hide header info !!SECURITY WARNING!!
curl_setopt($ch, CURLOPT_POST, TRUE);                                   // TRUE to do a regular HTTP POST.
curl_setopt($ch, CURLOPT_POSTFIELDS, $api_message_xml);                 // In my case, the XML form that will be submitted
curl_setopt($ch, CURLOPT_TIMEOUT, 15);                                  // Target API has a 15 second timeout
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);                         // TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.

// store the response
$ch_result = curl_exec($ch);

// close connection
curl_close($ch);

// parse XML with the whmcsapi_xml_parser function
$whmcsapi_arr = whmcsapi_xml_parser($ch_result); 

// Output returned value as Array
// print_r($whmcsapi_arr); 

// Encode in JSON
$json_whmcsapi = json_encode((array) $whmcsapi_arr);
echo $json_whmcsapi;


Off-Topic
But good snippet for cURL by David Walsh
// set POST variables
$url = 'http://domain.com/get-post.php';
$fields = array(
        'lname' => urlencode($last_name),
        'fname' => urlencode($first_name),
        'title' => urlencode($title),
        'company' => urlencode($institution),
        'age' => urlencode($age),
        'email' => urlencode($email),
        'phone' => urlencode($phone)
);
// url-ify the data for the POST
foreach($fields as $key=>$value) { $fields_string .= $key.'='.$value.'&'; }
rtrim($fields_string, '&');

// open connection
$ch = curl_init();

// set the url, number of POST vars, POST data
curl_setopt($ch,CURLOPT_URL, $url);
curl_setopt($ch,CURLOPT_POST, count($fields));
curl_setopt($ch,CURLOPT_POSTFIELDS, $fields_string);

// execute post
$result = curl_exec($ch);

// close connection
curl_close($ch);

Things I stumbled upon regarding SSL and cURL
Posted data for third-party apps is often required via SSL so this may come in handy
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);

// TRUE to output SSL certification information to STDERR on secure transfers.
curl_setopt($ch, CURLOPT_CERTINFO, TRUE); 

curl_setopt($ch, CURL_SSLVERSION_SSLv3, TRUE); 

Future Considerations
The data still hasn't been properly decoded from UTF-16 and encoded to UTF-8
  • Test writing to a file, re-encoding the file then reading from it.

Helpful Links Well this is my stop. It's being several hours that for others could have taken a several minutes if you knew where to look. My aim was to convert UTF-16 received XML to UTF-8 in order to convert XML to JSON and that has been achieved in part. It's 6am and I'm off to bed.


Related Articles

Joes Revolver Map

Joes Word Cloud

Accreditation

Badge - Certified Zoho Creator Associate
Badge - Certified Zoho Creator Associate

Donate & Support

If you like my content, and would like to support this sharing site, feel free to donate using a method below:

Paypal:
Donate to Joel Lipman via PayPal

Bitcoin:
Donate to Joel Lipman with Bitcoin - Valid till 8 May 2022 3QnhmaBX7LQSRsC9hh6Je9rGQKEGNQNfPb
© 2021 Joel Lipman .com. All Rights Reserved.