Tuesday, July 16, 2024

Ugly hack turns into beautiful way to brute force information out of badly

Running into an issue with extracting data from a badly formatted json response from an api. This has taken me days to finally make a workaround. With all the json libraries I found they parse the entire message and refuse to give any part of that message back, even though I could see the information I wanted right there in the response debug printout.

So I came up with a solution to clip out the thing I needed in a brute force way.  Then I realized I could just as easily give before, or after or both as well with an operator, and then I realized I could repeatedly call the operators in groups of 5 to perform multiple operations clipping and returning of values on the remaining part.  Very powerful technique.  I could also add a count for it to match so many times before it stopped, so you could find the 5th block that matched the search critera. 

I added this to the selector field.  if you start the selector with "badformat" it overrides the format of that service for the extraction, while allowing it to remain the same format for building web headers and the like. 

Just wanted to make a note of this to work on it later. I can see this as a function that can perform multiple operations of cutting and clipping with a very easy format. 

def extract_response_data(format: str, selector: str, response_text: str) -> Any:
logging.info(f"Starting extraction for format: {format}")
logging.debug(f"Parameters - Format: {format}, Selector: {selector}, \nResponse Text: {response_text}...")

# check to see if selector overrides format for extraction
if selector.startswith('bad_format'):
logging.debug(f"overriding format {format} to create bad_format message")
parts = selector.split('.')
format = 'bad_format'
parts = selector.split('.')

try:
if format == 'bad_format':
# this brute force extracts
# from badly formatted message
#match on first and last tags
# then offset both locations
# return the bit between those two points.
'''
turn this into a function
This is very powerful
allow multiple operations
and use repeating 4 fields with an option flag
to perform these operators:
'before': Returns the content before the cut.
'cut': Returns the content between the start and end tags.
'after': Returns the content after the cut.
'both': Returns the content before and after the cut.
'''
start_tag = parts[1]
start_idx_offset = int(parts[2])
end_tag = parts[3]
end_idx_offset = int(parts[4])

start_idx = response_text.find(start_tag) + len(start_tag)
end_idx = response_text.find(end_tag)
result = response_text[start_idx+start_idx_offset:end_idx+end_idx_offset]

elif format == 'json':
            ...
 

No comments:

Post a Comment