Question

1 Approved Answer

Posted on Sep 26, 2024

This problem is designed to be solved without any packages besides those included in base python 3.6/3.7. However, you are welcome to use and install

This problem is designed to be solved without any packages besides those included in base python 3.6/3.7. However, you are welcome to use and install any external packages that you'd like, but you must integrate them into the existing [pipenv](https://docs.pipenv.org/) configuration set-up in the repo. ## Problem Introduction A client has an older CRM where every type of csv, tsv or delimited export that they make of their data cannot be ingested via normal csv-reading methods. The reason for this is that the fields themselves often contain problematic characters like new line characters ` `  ordinarily, this would not pose a problem because a proper export would be able to surround each field with quotes (and escape any quotes inside the field with another quote) which would then enable the files to be read by any csv-reading package, but unfortunately this system is unable to do that. The client settles on the following solution: they will export their data with a multi-character delimiter like `{[]}` that they have confirmed is not present in any field in their data (so it is guaranteed to be able to split fields), and then they will tell us the number of fields per record in the data they are exporting. Your goal is to write a program in Python 3.7 that takes command-line arguments for the filename, the delimiter string, and the number of fields per record and then prints to the console tuples representing each record in the data export. ### Example Here is a sample dsv with a delimiter of `{[]}` and a field count of 4 (this file `sample.dsv` is found in this repo): ``` Name{[]}Age{[]}Welcome Email{[]}Email Address Bob{[]}24{[]}Hello Bob! I'm so happy to have you on board. Looking forward to meeting you soon. Best, Hendrik{[]}bob@bob.com Ned{[]}49{[]}Greetings Ned - Hope this finds you well. I can't believe it's been... 10"YEARS since we last saw each other. Hope all is well with you and the kids at Winterfell. Robert P.S. Want to be hand of the King?{[]}ned@stark.com" ``` After implementing the `parse_file.py` python script, running it (in the same directory as `sample.dsv`) with the

A client has an older CRM where every type of csv, tsv or delimited export that they make of their data cannot be ingested via normal csv-reading methods. The reason for this is that the fields themselves often contain problematic characters like new line characters ` ` ordinarily, this would not pose a problem because a proper export would be able to surround each field with quotes (and escape any quotes inside the field with another quote) which would then enable the files to be read by any csv-reading package, but unfortunately this system is unable to do that. The client settles on the following solution: they will export their data with a multi-character delimiter like `{[]}` that they have confirmed is not present in any field in their data (so it is guaranteed to be able to split fields), and then they will tell us the number of fields per record in the data they are exporting. Your goal is to write a program in Python 3.7 that takes command-line arguments for the filename, the delimiter string, and the number of fields per record and then prints to the console tuples representing each record in the data export.  ### Example Here is a sample dsv with a delimiter of `{[]}` and a field count of 4 (this file `sample.dsv` is found in this repo): ``` Name{[]}Age{[]}Welcome Email{[]}Email Address Bob{[]}24{[]}Hello Bob! I'm so happy to have you on board. Looking forward to meeting you soon. Best, Hendrik{[]}bob@bob.com Ned{[]}49{[]}Greetings Ned - Hope this finds you well. I can't believe it's been... 10"YEARS since we last saw each other. Hope all is well with you and the kids at Winterfell. Robert P.S. Want to be hand of the King?{[]}ned@stark.com" ``` After implementing the `parse_file.py` python script, running it (in the same directory as `sample.dsv`) with the following options should create the following output: ``` $ python3 parse_file.py -k 4 -d {[]} sample.dsv ('Name', 'Age', 'Welcome Email', 'Email Address') ('Bob', '24', "Hello Bob! I'm so happy to have you on board. Looking forward to meeting you soon. Best, Hendrik", 'bob@bob.com') ('Ned', '49', 'Greetings Ned - Hope this finds you well. I can\'t believe it\'s been... 10"YEARS since we last saw each other. Hope all is well with you and the kids at Winterfell. Robert P.S. Want to be hand of the King?', 'ned@stark.com"') ``` Note that despite the file being 20 lines long, the parser was able to read it into the 3 tuple records of 4 fields each. Note also that quotes (which are often used by csv-readers to indicate that one should ignore delimiters in the middle of them) should not be used in this case  the multi-char delimiter should be the only thing that determines when a field ends. Detailed Problem Specification Let us first define some variables: * `delim` - The delimiter string (can be more than one character) * `K` - the number of fields per record (always a nonzero integer) * `N` - the number of records in the file (always a nonzero integer)  Input File Specification Then the following will be true of the input file: * The file will contain exactly `N*(K-1)` instances of `delim` (meaning that the file has `N` records of `K` fields each) * Every record will end in a new line ` ` character (but not every new line ` ` character means the end of a record). * Assume that the final field in each record was pre-selected such that it does NOT contain any new line characters (if it did it would be impossible to know when the next record began). This means that you can consider the `K`-th field to be everything up to the new line ` ` character after having read `K-1` delimiters while building the record. Program Specification Input The program should operate off of 3 command line arguments: * `-d`: The delimiter `delim` (multicharacter delimiters must be supported) * `-k`: The integer `K` number of fields per record * `last argument`: The filename of the file to read (`sample.dsv` in the above example) A valid command line call would be: ```bash python3 parse_file.py -k 4 -d {[]} sample.dsv ``` or, if you've setup pipenv: ```bash pipenv run python parse_file.py -k 4 -d {[]} sample.dsv ```  Output Output to the console a series of `N` python data tuples (you can just use the built-in tuple's default `__repr__` method by printing it). Each of the printed tuples should be of length `K`, representing the fields in each record in the file according to the spec above. Bonus Points The following are some (optional) optimizations that you might consider including in your solution: * Write your parser in such a way that you never read the whole file into memory, but rather only keep a small buffer as you stream the file line-by-line and output the tuples as you are iterating through it. This is hugely important for files that may be many GB in size! * Abstract away the file parsing into its own method or class so that the actual printing of the tuples can be done in a separate part of the code from the parsing to generate them. This would allow you to do something else with the tuples (e.g. write them to a DB) without having to change your main parsing code. * Throw helpful errors (or direct the user to a -h option) if the command line arguments provided violate the Input argument assumptions in either number or types. * Raise an Exception if the file violates any of the assumptions in the Input File Spec, especially in the case that you have already read `K-1` delimiters while parsing a record and can't reach the end of a line without running into more delimiters.