Instructor: David Schuff, Section 003

Weekly Question #5: Complete by March 9, 2017

Leave your response as a comment on this post by the beginning of class on March 9, 2017. Remember, it only needs to be three or four sentences. For these weekly questions, I’m mainly interested in your opinions, not so much particular “facts” from the class!If you sign in using your AccessNet ID and password you won’t have to fill in the name, email and captcha fields when you leave your comment.

Here is the question:

In class we discussed several examples of how data can be stored in different formats in two different sources (i.e., a date can be represented mm/dd/yyyy or dd/mm/yyyy). Give your own example of a scenario, other than the ones we discussed in class, where data could be represented two different ways. How would you reconcile this conflict when combining the data sources?

5 Responses to Weekly Question #5: Complete by March 9, 2017

  • In the building I live in, we were recently given recycling bins. When we went to go pick up the bins, there were 2 lines. I was in the first and my neighbor was in the second. When we picked up our bins, we were asked our room number. The person putting in the information in my line asked my room number and I said 513 North. This information was put in one cell. When my neighbor was asked, “514” was put in one cell and “North” was put in another. I think a way this problem can be solved is being consistent with data entry. The bin distributors should have decided which way they were going to input the data before they began.

    • Ami, I had a similar experience in my own apartment building. I have bedroom 322A, and my roommate is 322B. Most of the time, the number is kept whole and we are treated like we live separately; separate bills, mail, packages. However, when my apartment building made an online portal and uploaded our leases, it became clear that they stored leases only by room number. My roommate has access to both of our leases on his account, while I cannot access either of our leases. This showed the landlord’s lack of forethought with data entry- they should have been storing leases by room and bedroom number.

  • An example of data being represented in two different ways in two different sources is the different file formats that a file can take. For example, a document on one computer can be saved in the .docx format; however, on a different computer, that same document can be saved in the .pdf format. In order to reconcile this conflict, we would need to convert the .docx document to .pdf or vice versa in order to match the file format on two different sources.

  • Every time I type in my phone number online I stop and decide whether I want to enter it like “xxx-xxx-xxxx” or leave out the dashes. I do not know if it makes a difference for the people using the information, but it would look a lot nicer if every phone number in the database had dashes. One way some companies deal with this is by having three boxes and once you type in the area code the cursor automatically jumps to the next box, etc. This way all phone numbers are displayed uniformly in the database.

  • One classic example I refer to when discussing the various ways the same information can be stored is the German Calendar example. In the United States, our calendar week runs from Sunday to Saturday. The German calendar week, on the contrary, runs from Monday to Sunday. The root of this difference comes from both religious and organizational backgrounds. I feel like the best way to reconcile this difference is to pick a format and run with it.

  • In research that I did last summer, I was working with two different databases of all properties in Philadelphia. Since some of the info was incomplete in both, I needed to create boolean statements comparing the zip codes to see if they matched. However, in one database they were stored as XXXXX-XXXX (the zip+4 number) and the other it was just the first five digits. Before I could compare, I needed to split the cells of the zip+4 codes into two different columns. That way I was able to make an equal comparison and verify/discover any errors in the data.

  • In the United States, people use the 12-hour clock to tell time; however, in some countries people use the 24-hour clock, or military time, as a way to tell time. When combining the data, it depends on which the preference of the receiver to know which clock to convert it to. In the U.S, one will need to assign each time on the 24-hour clock to the specific time in the U.S. For example, “1:00” (military time) will need to be changed to “1:00 AM” (12-hour clock).

  • One type of data that could be entered multiple ways is height. In the U.S., people typically use feet and inches as opposed to just inches. (I am 5 foot 1 inch vs. I am 61 inches tall) If the database is for looking up height, I would use a separate column for feet and inches. If it was to find relationships based on height, I would use inches. To combine the two different methods, you could create a new calculated column to change one of the methods.

  • An example of data being entered into databases in different ways is the use of the phone number in databases. Some have a field with no dashes and some have dashes. It may not be a problem of “incorrect” data being used but it could end up being an issue if the person entering the data used dashes and goes over the character limit for the field and not all data is entered or the database rejects the data because it is not presented in the normal way for the database. A solution to this issue is to keep consistency throughout the database and have a normalized form of entry. With query commands, one could change the format of the field to allow what ever the agreed upon entry would be.

  • I noticed where I came from, Asia, people often write address in the format No.3, alley 134, ABC street, etc. However, in the U.S, addresses are written in the format of address 1 which is the street address and address 2 is apartment number or such. Therefore, when reconciling two different databases from Asia and the U.S, we can combine address 1 and address 2 of the U.S database into one address, or separate the original address in Asian format into two addresses. However, the latter seems to be more difficult and complicated.

  • The first example that comes to mind is abbreviated address types such as St. or Rd.. If there are not rules in place when the data is entered, one user could input St. and another could input Street. Although both formats of the data are correct, the actual characters used do not match. To reconcile this I would suggest choosing one format abbreviated vs. full text and converting the others to meet that format.

  • I worked on a project using public databases to try to identify potentially profitable properties throughout Philadelphia, namely those that might soon be foreclosed. We had to deal with zip codes being stored in a number of ways, either as a 5 digit number, the full 9 digits of the zip code, and a mix, where the first 5 and the last 4 were hyphenated. We had to edit each database separately before doing any work between them. We decided the easiest method was to simplify down to the first 5 digits, so that we didn’t have any missing data. We created a second column to store the final four digits for the properties where we were given the full zip code, but we hid the column as it did not help us with our analysis.

  • An example of data being entered into databases in different ways is measurements using different systems. In the US, we use the Customary system while is most other countries they use the Metric system. The product will not change in weight but the numbers recorded will be different due to the different units of measurement. To solve this issue we could make two columns and create and a formula to automatically convert the Customary to Metric and Metric to Customary. Both columns would be important to have because depending on who is looking at the data, they may not understand both units of measurements.

    • Dean, I agree with your opinion of databases being used in different ways according to units of measurement. When I annually visit my family in Greece I continually run into the conversion issue between Metric and Customary measurements which each country used by standard. If there were a database combining the two types of measurements, someone from the United States would rather see the two tables which have the original Customary along with the converted Metric data to make it easier to interpret and conduct analysis. However someone in Greece wouldn’t know what the values we consider standard mean, two tables of Metric and converted Metric data would be ideal for anyone in that circumstance.

  • An example of how data can be stored in two different formats are shoe sizes. I noticed that depending on the company or brand it will list shoes in either list it in US standard or EU standard. But different countries have a slightly different way of measuring the shoe size. In the UK and Ireland, shoe size is measured with barleycorn. I think the best way to fix this it to have everyone agree to one standard sizing measurement.

  • One example of data being entered into the database differently is the way different countries represent time. There is the difference between how we enter the 24-hour clock and the 12-hour clock. For example, 22:00pm and 10:00pm both mean ten o’clock. Some countries utilize the 24-hour clock and some others use 12-hour clock. It is important to enter the correct time in the database because it can create confusion. Every data should be consistent. If we enter data using the 24-hour clock, then we should stick with using that throughout.

  • An example of data being represented in two different ways is credit card numbers when you’re making online transactions. Some require the space/dash in between the numbers, such as, 4444-4444-4444-4444 and then some require no space, such as, 44444444444444444. I think the one with the space/dash is easier to read so I would reconcile this conflict by creating a database where the card number has to be inserted into 4 separate text boxes with a dash separating each text box. That way, it more so represents what the numbers on one’s credit card will look like instead of being a long line of 16 numbers.

  • One example of data being shown in two different ways, is when people use fahrenheit to celsius to say what the temperature is. 32 degrees fahrenheit is the same temperature as is 0 degrees celsius. It can confuse people, because if you say its 50 degrees, you don’t know if they mean celsius or fahrenheit, which is an extreme difference in temperature. To fix this I think everyone should learn and use the metric system.

  • An example of data being store in two different ways is the street name. In some databases you could type in the whole street name or you could shorten it into abbreviations. This is well use in my GPS. I think the best ways to reconcile these two data is we had to specified the most common ways words can be shorten such as “road” to”rd” and “North” to “N”.

  • An example of data being represented in two different way is regular time (12 hour rounds) or military time. This summer at my internship some employees previously served in the military and used that timing system to set up meetings or give due dates. I think 12 hour rounds is easier because it is what I have grown up using, and is quicker for me to read. I think someone’s preference for one over the other comes from their experience and their environment, if they are used to using military or regular time. I would reconcile the data by putting it into 12 hour time zones since the majority of people use that system.

  • One example of data being represented in two different ways is different forms of time keeping. While most use a 12 hour clock consisting of AM and PM some prefer to use a 24 hour clock, mainly in the army. When combining this data I would first find out what the preference is and then either turn any hour past 12 on the 24 hour clock to its corresponding PM time (example: 13:00 = 1:00 PM). This can also be inversely done by making any PM hour equal its corresponding 24 hour time (example: 1:00 PM = 13:00).

  • An example of data being in two separate formats arose at my summer internship because the person before me had created ways of storing information and data in excel sheets much differently than I thought made sense. Therefore, as part of my job, I had to reconcile, confirm, and reformat the information that already existed, make sense of the pieces that just did not make sense, and create an efficient way to read lots of information successfully. The end result was a data cube of sorts with the information in a more confined and updated format.

  • One example I can think of is how a company records inventory. So if a company records inventory with the UPC at one place, and at another place assigns a unique product code it represents the same inventory but with different identifiers. One way to resolve the discrepancy is with a lookup table, and link the UPC over to the unique product code.

  • An potential example of data being stored differently between two sources would be phone numbers. Some methods utilize parentheses as well as dashes to separate the different aspects of the phone number. In one data source, a phone number could be stored as 1234567890. In other sources, it could be stored as either (123) 456-7890 or 123-456-7890. To fix this, you could use the special phone number format function to change all of the straight 10 digit phone numbers into the correct format.

  • An example of data being represented in two different sources is temperature. In the United States, they use Fahrenheit while in other parts of the world like Europe they use Celsius. In order to reconcile this conflict will need to choose one (Fahrenheit or Celsius) convert the temperature and have all the data in that unit of temperature.

  • An example of data represented in more than one way is when purchasing online clothing. Companies from different countries will use their sizing metrics that are different from the U.S. It gets confusing as to what size to order, even the size chart is complicating to read because it isn’t always exact. I think brands should have country codes you can input into the website to convert the entire website to U.S. metrics to make it more concise.

  • An example of data being represented in two different sources is the difference of dollars and euros. If you are spread-sheeting the revenues of companies that operate both in the US and Europe some may have entries pertaining to their overall revenue in Euros as compared to dollars. The best way to solve this issue would be to simply choose which way the company would like to to recognize their revenue. Personally I think the standard US dollar would be best because I feel most people will be able to understand that, as most people do not know how much an amount in Euros equals in US dollars.

  • An example of how data can be stored in two different formats is a phone number. Some people store just the numbers, and some people store it as xxx-xxx-xxxx. To make it all uniform, I would store them as xxx-xxx-xxxx. That way, its easier to read and the data input would all be uniform.

  • An example of different data representation is when using our system of time versus military time. We would say 1:00 PM while military time would represent that same piece of data as 13:00. If this was in a database it would create inconsistencies.

  • An example of this scenario could be the length of something. Since the United States doesn’t use the metric system, our numbers will be in a different format than those of other countries. For example, we may put the distance of something in miles, as another country would have the distance in kilometers, and our data wouldn’t be consistent between databases.

  • One scenario that I think applies would be store locations. Sometimes store locations may be listed by a number according to the order that they appear on a separate sheet. The majority of the time, they will appear as the actual location. For example, you might see “1” and “Philadelphia”. Store 1 may mean that that store is in Philadelphia, but when you look it up, the results will not show that so you need to make them equal. Reconciling this data would happen when you actually set Store Location “1” equal to Store Location “Philadelphia”, so that when you look up the results, they will be equal.

  • One example of data being represented two different ways is measuring the height of a basketball player. In the US a player is measured by feet and then followed by inches, such as 6 foot 2 or 6″2. Players in international leagues would be measured in meters and increments of a meter. An example would be a player who is 1 meter and 92 centimeters or 1.92. To solve that would mean either the US would have to adapt to the metric system or the rest of the world adapting to the imperial system. Otherwise a player’s height could get lost in translation for those who don’t know the other.

  • One example of data being stored in two different sources is having data in the American/Standard Unit measurement and Metric measurements. Feet and meters are finite measurements that can be converted into each other with a simple calculation. When combining these data sources it could be possible to implement a data conversion of one measurement to the other. So if a data source is coming from Europe, these measurements can automatically be converted to the American system by having a function that recognizes metric notations (cm, m, km ect) and convert it to a relative American measurement.

  • An example of data being represented in different ways can be shown on a vehicles speedometer because as a vehicle picks up speed, a measurement is displayed. Most cars have two different sets of numbers one being the MPH and the other KM/H. They’re both being recorded and displayed accurately at the same time but can be used to measure different units of speed. Some countries have their speed limit signs in strictly KM/H. A simple unit conversions chart can resolve an issue in reading the data.

  • An example of this is how Vietnamese names may be stored differently in a US database and a Vietnamese database. A Vietnamese name can be three or four words long. A first name can comprise two separate words. However, many US online forms, which reflect their databases, limit first and last names to one word. This leads to potential mismatch between the first name column in a Vietnam-based database and one based in the US. One way to resolve the issue is to use a certain function to take out the first word of each two-word value, “leaving the main first name” intact. This is not ideal, because some people reverse their two-word names in confusion when trying to fill out US forms. If high accuracy is required, programs or actual humans may have to refer to Vietnamese naming conventions to correct those entries before removing the first word.

  • A good example of data being represented in two different formats are measurements. For example, while not excuse to one or the other, Europe generally tends to favor measuring in Kilometers rather rather than in miles like in America; but it all comes down to the basic idea of distance. For another example. I could tell someone to meet me at an establishment one mile away, but in Europe, or really anywhere in general, people could also say they would like to meet at the area in the same distance, but instead say 1.6 kilometers.

  • Part of my research last summer involved collection of data on consumer goods. This included items from washing machines to cars and trucks. After collection of the data, I realized that the format of data for a truck and model was different then the format of a car and model. For example, a truck may have been stored in the data as “Ford”, “F150”, while many cars were stored in the following format: “SubaruImpreza”. In order to reconcile this format inconsistency, I wrote a script that would recognize when there was a second capital letter in the cell and then break the cell into two cells from there. For example, it would break “SubaruImpreza” into “Subaru” “Impreza”.

  • One example of how data can be stored in different formats in two different sources is entering phone numbers online. If there are no instructions then it is hard to decide whether they want your phone number to have dashes or not. A way to reconcile this conflict is by providing the dashes where people would enter their number.

  • An example of data being presented in two different ways would be when looking at roman numerical system. Similar to military time the roman numeral system presents values in a different form (III = 3, V=5 ,X=10,etc.). This system is still used in many books where the book contains a pre-face which authors number that section of the book with roman numerals so the reader doesn’t get confused as to where the actual story starts.

  • One case of information that can be entered/put away in two distinctive ways is true/false or male/female reactions. You can either compose the entire word, or utilize normal condensing, for example, T/F and M/F. To accommodate this, you would need to ensure that the two distinctive ways level with each other, for example, T=True or M=Male. This is additionally like utilizing shortened forms for different words, for example, Mon for Monday, Tues for Tuesday, et cetera.

  • An example of data in different formats was during my internship when describing products of origin from different plant locations in our WMS. In one column the format was PalletID-PlantLocation-ManufacturingDate while in another column it was ManufacturingDate-PalletID-PlantLocation. While it all was the same information, the entry was in a different order and therefore redundant. Once I understood the similarities and was confident that there wasn’t any alternative meanings, I ended up deleting one of the columns along with the other unnecessary information in order to continue on with my analysis of inventory optimization.

  • a way that data can be stored differently is measurement. The metric system is used in other parts of the world, but not in the states. When I go to the gym here, the plates have 45 on them (lb), but in Asia, the plates are 22 (kg). In competitions and records in different parts of the world, the same data may be presented in different numbers

  • The model and the sizes data differences. In amazon, different clothes and shoes have different sizes in accordance to the shape. Moreover, in the electronics section, the model type would change in accordance to the model differences.

  • One example of how data can be stored in different ways is your TUID with dashes after the 3 numbers and without the dashes.

  • An example of how data can be stored in different formats in two different sources is the time difference between west coast and east coast in the US. I often have conference calls with some professionals on the west coast such LA, Santiago, which are 3 hours behind Philadelphia. So it’s important for me to specify the time for different locations and avoid the error and misunderstanding we have in the data.

Leave a Reply

Your email address will not be published. Required fields are marked *

Where and when do we meet?
Alter Hall 232
11:00 - 12:20 Tuesdays and Thursdays
Office Hours
David Schuff (instructor):
10:00-11:00, Tuesdays and Thursdays
Speakman Hall 207G and email (see my site)

Nodir Zakhidov (ITA):
Monday: 1:00-2:00
Wednesday: 1:00-2:00
Speakman Hall 207 and email (see his site)