Thursday, January 29, 2015

Understanding binary file

Most of DVB-Ts and TVs got channel list in file called dtv_channel.txt Of course there are tools to manipulate this file content, but most of them is for windows and I just need export the list to show it in my web customer remote controler.
We begin with hexdump, but to understand it better we convert the output to human readable ASCII and add underscore after each char to separate them for better reading using following command:

hexdump -e '"%_u\_"' dtv_channel.txt

syn_nul_so_nul_lf_nul_* 
O_c_k_o_ _G_o_l_d_nul_* 
88_fe_85_dle_h_b4_85_dle_80_V___._nul_* 
bs_nul_* 
cb_ _stx_etx_stx_eot_nul_* 
soh_nul_* 
dle_nul_ht_nul_* 
O_c_k_o_ _G_o_l_d_nul_* 
soh_etb_soh_nul_* 
soh_etb_stx_nul_* 
soh_nul_* 
stx_etb_etx_nul_e_z_c_nul_* 
ff_nul_* 
ff_nul_* 
etb_nul_si_nul_ht_nul_* 
S_l_a_g_r_ _T_V_nul_* 
dle_stx_86_dle_nul_fb_85_dle_80_V___._nul_* 
bs_nul_* 
cb_ _stx_etx_soh_syn_nul_* 
soh_nul_* 
dle_nul_bs_nul_* 
S_l_a_g_r_ _T_V_nul_* 
soh_ht_soh_nul_* 
soh_ht_stx_nul_* 
soh_nul_* 
stx_ht_etx_nul_e_z_c_nul_* 
ff_nul_* 
ff_nul_* 
can_nul_dle_nul_bel_nul_* 
A_C_T_I_V_E_nul_* 
88_fe_85_dle_80_V___._nul_* 
bs_nul_* 
cb_ _stx_etx_soh_fs_nul_* 
soh_nul_* 
dle_nul_ack_nul_* 
A_C_T_I_V_E_nul_* 
soh_nak_soh_nul_* 
soh_nak_stx_nul_* 
soh_nul_* 
stx_nak_etx_nul_e_z_c_nul_* 
ff_nul_* 
ff_nul_*
After closer look we can see that there is clear record separator (we can divide to rows)
ff_nul_* 
ff_nul_*
Another think we can see is each "line" ends with nul_* in fact this in the reality it's more nuls. Some file formats use fix length for filed and if data is shorter it's filled with nuls which is this case.

So let's expect that nul* is field separator. Also there are lines containing bs_nul_* and soh_nul_* and they repeat in pattern, so most probably this is also some kind of separator.

Now lets used couple of seds to get some more readable format which we can further analyze.
hexdump -e '"%_u\_"' dtv_channel.txt | sed 's/^ff_nul_\*$/;;/g' | sed 's/nul_\*/,/g' | sed 's/^soh_,$//g' | sed 's/^bs_,$//g' | while read line; do echo -n $line; done | sed 's/;;/;\n/g' | sed '/^;$/d'

Let me explain for those not familiar:
#substitute nul_* with , (comma), we will use comma as filed separator
sed 's/nul_\*/,/g'
#remove lines which contains only soh_, and bs_,
sed 's/^soh_,$//g' | sed 's/^bs_,$//g'
#now let's get all data in one single line
while read line; do echo -n $line;
#and get it separated in to lines based and add ; (semicolon) as clear record separator I used just one ff_, because I find sometimes not every two ff_ separate record (not sure why)
sed 's/,ff_,/;\n/g'

we can remove Following from begining as it's file header ack_,ff_em_,

Now this is example of output
soh_nul_soh_nul_bs_,C_T_ _1_ _J_M_,f8_c7_85_dle_,80_:_dcl_ _,cb_ _dc2_soh_*,dle_nul_bel_,C_T_ _1_ _J_M_,soh_*,soh_*stx_,stx_,dcl_soh_etx_nul_e_z_c_nul_dc3_soh_etx_,;
!_soh_*,e_z_c_,;
enq_nul_soh_nul_dle_,C_R_o_ _R_A_D_I_O_Z_U_R_N_A_L_,`_8d_85_dle_b8_83_85_dle_80_:_dcl_ _,cb_ _dc2_soh_*A_,stx_,_nul_si_,C_R_o_ _R_A_D_I_O_Z_U_R_N_A_L_,dcl_dle_,dcl_dle_etx_nul_e_z_c_,;
stx_nul_stx_nul_enq_,C_T_ _2_,80_cb_85_dle_f8_c7_85_dle_80_:_dcl_ _,cb_ _dc2_soh_stx_soh_,dle_nul_eot_,C_T_ _2_,soh_stx_soh_,soh_stx_*,stx_,dcl_stx_etx_nul_e_z_c_nul_dc3_stx_etx_,;
!_stx_soh_*,e_z_c_,;
ack_nul_stx_nul_vt_,C_R_o_ _D_V_O_J_K_A_,p_d4_85_dle_`_8d_85_dle_80_:_dcl_ _,cb_ _dc2_soh_stx_A_,stx_,_nul_lf_,C_R_o_ _D_V_O_J_K_A_,dcl_*,dcl_*etx_nul_e_z_c_,;
etx_nul_etx_nul_ack_,C_T_ _2_4_,bs_cf_85_dle_80_cb_85_dle_80_:_dcl_ _,cb_ _dc2_soh_etx_soh_,dle_nul_enq_,C_T_ _2_4_,soh_etx_soh_,soh_etx_stx_,stx_,dcl_etx_*nul_e_z_c_nul_dc3_etx_*,;
!_etx_soh_*,e_z_c_,;

we can see some shorter line like this !_stx_soh_*,e_z_c_,; if we take closer look we will find that nul_* is not always field, separator (nuls can be valid part of data).
Never the less we can remove those shorter lines for the moment or manually move them

So now we have "nice" output we can open in any spreadsheet program and parse by comma. We see that 2nd column contains Name of channel. If we elaborate more on first column we can see that nul_ separate some 3 values.

soh_nul_soh_nul_bs_
enq_nul_soh_nul_dle_
stx_nul_stx_nul_enq_
ack_nul_stx_nul_vt_
etx_nul_etx_nul_ack_
bel_nul_etx_nul_vt_
eot_nul_eot_nul_ht_
bs_nul_eot_nul_si_
ht_nul_enq_nul_lf_
cr_nul_enq_nul_ack_
lf_nul_ack_nul_dcl_
so_nul_ack_nul_ff_
1st is actually 1,5,2,6,3,7,4,8,9,13... as far as values are unique I guess this might be index
2nd values are actually sequential, but twice (1,1,2,2,3,3...) that's because there are TV's and Radios mixed together.
3rd as we can see values repeating here randomly so there is and values are: 8,16,5,11,6,11,.... this is slightly complected, but actually it's length of channel name+1

As already mentioned the file use fix length of fields and records, we already have some knowledge of the data so lets try to find the length of record and hopefully also fields, to get that lets use following command:

hexdump -C dtv_channel.txt

Lets use colors to highlight few thinks
Beginning of record
Index, Channel Number, Length of channel name
Channel Name

AddressData in HEXHuman readable data
00004da000 00 00 00 00 00 00 00 17 00 0f 00 09 00 00 00|................|
00004db053 6c 61 67 72 20 54 56 00 00 00 00 00 00 00 00|Slagr TV........|
00004dc000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00|................|
00004dd010 02 86 10 00 fb 85 10 80 56 5f 2e 00 00 00 00|.........V_.....|
00004de008 00 00 00 cb 20 02 03 01 16 00 00 00 00 00 00|..... ..........|
00004df001 00 00 00 00 00 10 00 08 00 00 00 53 6c 61 67|............Slag|
00004e0072 20 54 56 00 00 00 00 00 00 00 00 00 00 00 00|r TV............|
00004e1000 00 00 00 00 00 00 00 00 00 00 00 01 09 01 00|................|
00004e2000 00 00 00 01 09 02 00 00 00 00 00 00 00 00 00|................|
00004e3000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00|................|
*
00004e8000 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00|................|
00004e9002 09 03 00 65 7a 63 00 00 00 00 00 00 00 00 00|....ezc.........|
00004ea000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00|................|
*
00004ed000 ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00|................|
00004ee000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00|................|
*
000050b000 ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00|................|
000050c000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00|................|
*
0000510000 00 00 00 00 00 00 00 18 00 10 00 07 00 00 00|................|
0000511041 43 54 49 56 45 00 00 00 00 00 00 00 00 00 00|ACTIVE..........|
0000512000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00|................|
0000513000 00 00 00 88 fe 85 10 80 56 5f 2e 00 00 00 00|.........V_.....|
0000514008 00 00 00 cb 20 02 03 01 1c 00 00 00 00 00 00|..... ..........|
0000515001 00 00 00 00 00 10 00 06 00 00 00 41 43 54 49|............ACTI|
0000516056 45 00 00 00 00 00 00 00 00 00 00 00 00 00 00|VE..............|
0000517000 00 00 00 00 00 00 00 00 00 00 00 01 15 01 00|................|
0000518000 00 00 00 01 15 02 00 00 00 00 00 00 00 00 00|................|
0000519000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00|................|
*
000051e000 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00|................|
000051f002 15 03 00 65 7a 63 00 00 00 00 00 00 00 00 00|....ezc.........|
0000520000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00|................|
*
0000523000 ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00|................|
0000524000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00|................|
*
0000541000 ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00|................|
0000542000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00|................|
*
0000546000 00 00 00 00 00 00 00 47 0e f9 12 |........G...|

by subtracting two addresses of record beginning
4da0 (19872 Dec)
5100 (20736 Dec)
we will get record length = 864
first 8 bytes can be droped

Now with this knowledge we can split into records, we can use split -b 864 and we will have one record per file. The we can use following to check more on structure of records:

for f in `ls x*`; do echo $f; hd $f; done

xat
00000000  00 00 00 00 00 00 00 00  13 00 0b 00 0b 00 00 00  |................|
00000010  50 72 69 6d 61 20 4c 4f  56 45 00 00 00 00 00 00  |Prima LOVE......|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000030  e0 b0 85 10 d0 a9 85 10  80 56 5f 2e 00 00 00 00  |.........V_.....|
00000040  08 00 00 00 cb 20 02 03  04 03 00 00 00 00 00 00  |..... ..........|
00000050  01 00 00 00 00 00 10 00  0a 00 00 00 50 72 69 6d  |............Prim|
00000060  61 20 4c 4f 56 45 00 00  00 00 00 00 00 00 00 00  |a LOVE..........|
00000070  00 00 00 00 00 00 00 00  00 00 00 00 01 02 01 00  |................|
00000080  00 00 00 00 01 02 02 00  00 00 00 00 00 00 00 00  |................|
00000090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000000e0  00 00 00 00 01 00 00 00  00 00 00 00 00 00 00 00  |................|
000000f0  02 02 03 00 65 7a 63 00  00 00 00 00 00 00 00 00  |....ezc.........|
00000100  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000130  00 ff 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000140  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000002b0  00 00 00 00 00 00 00 00  00 00 00 00 01 00 00 00  |................|
000002c0  00 00 00 00 00 00 00 00  03 02 01 01 00 00 00 00  |................|
000002d0  65 7a 63 00 00 00 00 00  00 00 00 00 00 00 00 00  |ezc.............|
000002e0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000310  00 ff 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000320  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000360
xau
00000000  00 00 00 00 00 00 00 00  14 00 0c 00 0b 00 00 00  |................|
00000010  50 72 69 6d 61 20 5a 4f  4f 4d 00 00 00 00 00 00  |Prima ZOOM......|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000030  68 b4 85 10 58 ad 85 10  80 56 5f 2e 00 00 00 00  |h...X....V_.....|
00000040  08 00 00 00 cb 20 02 03  06 03 00 00 00 00 00 00  |..... ..........|
00000050  01 00 00 00 00 00 10 00  0a 00 00 00 50 72 69 6d  |............Prim|
00000060  61 20 5a 4f 4f 4d 00 00  00 00 00 00 00 00 00 00  |a ZOOM..........|
00000070  00 00 00 00 00 00 00 00  00 00 00 00 01 14 01 00  |................|
00000080  00 00 00 00 01 14 02 00  00 00 00 00 00 00 00 00  |................|
00000090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000000e0  00 00 00 00 01 00 00 00  00 00 00 00 00 00 00 00  |................|
000000f0  02 14 03 00 65 7a 63 00  00 00 00 00 00 00 00 00  |....ezc.........|
00000100  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000130  00 ff 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000140  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000310  00 ff 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000320  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000360


We can see that from address 310 there are just null's, we also can see that in some cases there are data between address 90 and 310, however checking on context those data doesn't help us to recognize between radio and TV channel so we can actually do two more things to make more clear:

We can just print first 144 bytes and also we can drop first 8 bytes as we figure out previously and compare again, now it will be easier to find what's always same for Radio stations and TV Stations.

root@raspberrypi:/media/nas/public# for f in `ls x*`; do echo $f; hexdump -s 8 -n 144 -C $f; done
xaa
00000008  01 00 01 00 08 00 00 00  43 54 20 31 20 4a 4d 00  |........CT 1 JM.|
00000018  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000028  00 00 00 00 00 00 00 00  f8 c7 85 10 00 00 00 00  |................|
00000038  80 3a 11 20 00 00 00 00  08 00 00 00 cb 20 12 01  |.:. ......... ..|
00000048  01 01 00 00 00 00 00 00  01 00 00 00 00 00 10 00  |................|
00000058  07 00 00 00 43 54 20 31  20 4a 4d 00 00 00 00 00  |....CT 1 JM.....|
00000068  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000078  00 00 00 00 01 01 01 00  00 00 00 00 01 01 02 00  |................|
00000088  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000098
xab
00000008  05 00 01 00 10 00 00 00  43 52 6f 20 52 41 44 49  |........CRo RADI|
00000018  4f 5a 55 52 4e 41 4c 00  00 00 00 00 00 00 00 00  |OZURNAL.........|
00000028  00 00 00 00 00 00 00 00  60 8d 85 10 b8 83 85 10  |........`.......|
00000038  80 3a 11 20 00 00 00 00  08 00 00 00 cb 20 12 01  |.:. ......... ..|
00000048  01 41 00 00 00 00 00 00  02 00 00 00 00 00 20 00  |.A............ .|
00000058  0f 00 00 00 43 52 6f 20  52 41 44 49 4f 5a 55 52  |....CRo RADIOZUR|
00000068  4e 41 4c 00 00 00 00 00  00 00 00 00 00 00 00 00  |NAL.............|
00000078  00 00 00 00 11 10 00 00  00 00 00 00 00 00 00 00  |................|
00000088  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000098
xac
00000008  02 00 02 00 05 00 00 00  43 54 20 32 00 00 00 00  |........CT 2....|
00000018  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000028  00 00 00 00 00 00 00 00  80 cb 85 10 f8 c7 85 10  |................|
00000038  80 3a 11 20 00 00 00 00  08 00 00 00 cb 20 12 01  |.:. ......... ..|
00000048  02 01 00 00 00 00 00 00  01 00 00 00 00 00 10 00  |................|
00000058  04 00 00 00 43 54 20 32  00 00 00 00 00 00 00 00  |....CT 2........|
00000068  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000078  00 00 00 00 01 02 01 00  00 00 00 00 01 02 02 00  |................|
00000088  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000098
xad
00000008  06 00 02 00 0b 00 00 00  43 52 6f 20 44 56 4f 4a  |........CRo DVOJ|
00000018  4b 41 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |KA..............|
00000028  00 00 00 00 00 00 00 00  70 d4 85 10 60 8d 85 10  |........p...`...|
00000038  80 3a 11 20 00 00 00 00  08 00 00 00 cb 20 12 01  |.:. ......... ..|
00000048  02 41 00 00 00 00 00 00  02 00 00 00 00 00 20 00  |.A............ .|
00000058  0a 00 00 00 43 52 6f 20  44 56 4f 4a 4b 41 00 00  |....CRo DVOJKA..|
00000068  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000078  00 00 00 00 11 11 00 00  00 00 00 00 00 00 00 00  |................|
00000088  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000098
xae
00000008  03 00 03 00 06 00 00 00  43 54 20 32 34 00 00 00  |........CT 24...|
00000018  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000028  00 00 00 00 00 00 00 00  08 cf 85 10 80 cb 85 10  |................|
00000038  80 3a 11 20 00 00 00 00  08 00 00 00 cb 20 12 01  |.:. ......... ..|
00000048  03 01 00 00 00 00 00 00  01 00 00 00 00 00 10 00  |................|
00000058  05 00 00 00 43 54 20 32  34 00 00 00 00 00 00 00  |....CT 24.......|
00000068  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000078  00 00 00 00 01 03 01 00  00 00 00 00 01 03 02 00  |................|
00000088  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000098
xaf
00000008  07 00 03 00 0b 00 00 00  43 52 6f 20 56 4c 54 41  |........CRo VLTA|
00000018  56 41 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |VA..............|
00000028  00 00 00 00 00 00 00 00  70 c4 85 10 70 d4 85 10  |........p...p...|
00000038  80 3a 11 20 00 00 00 00  08 00 00 00 cb 20 12 01  |.:. ......... ..|
00000048  03 41 00 00 00 00 00 00  02 00 00 00 00 00 20 00  |.A............ .|
00000058  0a 00 00 00 43 52 6f 20  56 4c 54 41 56 41 00 00  |....CRo VLTAVA..|
00000068  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000078  00 00 00 00 11 12 00 00  00 00 00 00 00 00 00 00  |................|
00000088  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000098
xag
00000008  04 00 04 00 09 00 00 00  43 54 20 73 70 6f 72 74  |........CT sport|
00000018  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000028  00 00 00 00 00 00 00 00  58 53 85 10 08 cf 85 10  |........XS......|
00000038  80 3a 11 20 00 00 00 00  08 00 00 00 cb 20 12 01  |.:. ......... ..|
00000048  04 01 00 00 00 00 00 00  01 00 00 00 00 00 10 00  |................|
00000058  08 00 00 00 43 54 20 73  70 6f 72 74 00 00 00 00  |....CT sport....|
00000068  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000078  00 00 00 00 01 04 01 00  00 00 00 00 01 04 02 00  |................|
00000088  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000098

So now I have everything I need I just need to have it in nice output:

Channel number
Channel name
Type of channel (Radio/TV)

00000008  01 00 01 00 08 00 00 00  43 54 20 31 20 4a 4d 00  |........CT 1 JM.|
00000018  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000028  00 00 00 00 00 00 00 00  f8 c7 85 10 00 00 00 00  |................|
00000038  80 3a 11 20 00 00 00 00  08 00 00 00 cb 20 12 01  |.:. ......... ..|
00000048  01 01 00 00 00 00 00 00  01 00 00 00 00 00 10 00  |................|
00000058  07 00 00 00 43 54 20 31  20 4a 4d 00 00 00 00 00  |....CT 1 JM.....|
00000068  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000078  00 00 00 00 01 01 01 00  00 00 00 00 01 01 02 00  |................|
00000088  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000098

#! /bin/sh

DTV_FILE='/media/nas/public/dtv_channel.txt'
split -b 864 $DTV_FILE

for f in `ls x*`
do
  channel_num_hd=`hexdump -s 10 -n 1 -e '"%_u"' $f`
  channel_num=`ascii -t $channel_num_hd | cut -d ' ' -f 4`
  channel_type=`hexdump -s 126 -n 1 -e '"%_u"' $f | sed 's/soh/TV/' | sed 's/nul/Radio/'`
  channel_name=`hexdump -s 16 -n 32 -v -e '"%_u\_"' $f | sed 's/nul_*//g' | sed 's/_//g'`
  echo "$channel_num,$channel_type,$channel_name"
done

rm -f x*