I am trying to scrape some data with scrapy using xpath selectors and i would like to get rid of the unnecessary data, the xpath selector is the following:
values = response.xpath ('//table[@class="pd-table"]//tbody//tr//td/text()').getall()
I know with .getall I can extract all data from the selector, which gives me the list below (I have added line breaks to make the list more readable - there are no line breaks in the actual output):
['Electrical design', 'PNP', 'Output function', 'normally open',
'Sensing range [mm]', '4', 'Housing', 'threaded type',
'Dimensions [mm]', 'M12 x 1 / L = 60',
'Special feature', 'Gold-plated contacts; Increased sensing range',
'Application', 'Industrial applications / factory automation',
'Operating voltage [V]', '10...30 DC', 'Current consumption [mA]', '< 10',
'Protection class', 'III', 'Reverse polarity protection', 'yes',
'Electrical design', 'PNP', 'Output function', 'normally open',
'Max. voltage drop switching output DC [V]', '2.5',
'Permanent current rating of switching output DC [mA]', '100',
'Switching frequency DC [Hz]', '700', 'Short-circuit protection', 'yes',
'Overload protection', 'yes', 'Sensing range [mm]', '4',
'Real sensing range Sr [mm]', '4 ± 10 %', 'Operating distance [mm]', '0...3.24',
'Increased sensing range', 'yes',
'Correction factor', 'steel: 1 / stainless steel: 0.7 / brass: 0.5 / aluminium: 0.4 / copper: 0.3', 'Hysteresis [% of Sr]', '3...15',
'Switch point drift [% of Sr]', '-10...10',
'Ambient temperature [°C]', '-40...85',
'Protection', 'IP 65; IP 66; IP 67; IP 68; IP 69K',
'EMC', '\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n',
'EN 61000-4-2 ESD', '4 kV CD / 8 kV AD', 'EN 61000-4-3 HF radiated', '10 V/m',
'EN 61000-4-4 Burst', '2 kV', 'EN 61000-4-6 HF conducted', '10 V', 'EN 55011', 'class B',
'\n', 'Vibration resistance', '\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n',
'EN 60068-2-6 Fc', '20 g (10...3000 Hz) / 50 sweep cycles, 1 octave per minute, in 3 axes',
'\n', 'Shock resistance', '\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n',
'EN 60068-2-27 Ea', '100 g 11 ms half-sine; 3 shocks each in every direction of the 3 coordinate axes',
'\n', 'Continuous shock resistance', '\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n',
'EN 60068-2-27', '40 g 6 ms; 4000 shocks each in every direction of the 3 coordinate axes',
'\n', 'Fast temperature change', '\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n',
'EN 60068-2-14 Na', 'TA = -40 °C; TB = 85 °C; t1 = 30 min; t2 = < 10 s; 50 cycles',
'\n', 'Salt spray test', '\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n',
'EN 60068-2-52 Kb', 'severity level 5 (4 test cycles)', '\n',
'MTTF [years]', '1642', 'UL approval', '\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n',
'UL Approval no.', 'A001 ', '\n', 'Weight [g]', '27.7',
'Housing', 'threaded type', 'Mounting', 'flush mountable',
'Dimensions [mm]', 'M12 x 1 / L = 60',
'Thread designation', 'M12 x 1',
'Materials', 'brass white bronze coated; sensing face: PBT orange; LED window: PEI; lock nuts: brass white bronze coated',
'Display', '\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n',
'switching status', '4 x LED, yellow', '\n',
'Items supplied', '\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n',
'lock nuts: 2', '\n', 'Pack quantity', '1 pcs.',
'Connection', 'Connector: 1 x M12; Contacts: gold-plated']
Within the list, each pair of data is equivalent to 1 Element which I want to format similarly to the sample below:
['Electrical design' : 'PNP',
'Output function' : 'normally open
....
'EMC' : 'EN 61000-4-2 ESD' : '4 kV CD / 8 kV AD' , 'EN 61000-4-3 HF radiated' : '10 V/m',
...
'Connector: 1 x M12 : Contacts: gold-plated']
I hope I was able to explain myself clearly. I need help to transform the raw data to a formatted list.
['Electrical design' : 'PNP', 'Output function' : 'normally openlooks like a dictionary data structure (bacause of:), but you include[(which points to a list) there, so it's a bit confusing what you really want. – pavelsaman Mar 25 '21 at 21:02